Add cache-aware placement affinity by ValentaTomas · Pull Request #2867 · e2b-dev/infra

ValentaTomas · 2026-05-30T01:54:29Z

Add disabled-by-default cache-aware placement affinity using recent successful build placements. Redis state is bounded by TTL/top-N limits and is best-effort.

Record recent successful placements and use them as a small, gated best-of-K bias so warm nodes are preferred without bypassing placement filters.

cursor · 2026-05-30T01:54:35Z

PR Summary

Medium Risk
Changes core sandbox scheduling and node selection; mis-tuned affinity or exhausted-node retry loops could skew load or add latency, though the feature is off by default and Redis updates are best-effort.

Overview
Adds disabled-by-default cache-aware sandbox placement: recent successful placements per build are stored in Redis (bounded TTL and top-N) and used as a score bonus in Best-of-K scheduling so repeat runs favor nodes that likely already have the image cached. Resume skips build affinity so it does not fight node pinning; after a successful create, affinity is recorded best-effort in the background.

Placement now treats resource-exhausted nodes separately from hard failures—exhausted nodes can be retried after a short backoff instead of counting only toward the fixed retry limit—and CreateSandbox can trigger on-demand node discovery when the cluster has no connected nodes (including a local orchestrator fallback).

^{Reviewed by Cursor Bugbot for commit b9916c4. Bugbot is set up for automated code reviews on this repo. Configure here.}

codecov · 2026-05-30T01:55:23Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
2703	2	2701	7

View the full list of 2 ❄️ flaky test(s)

github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 57.71% (Passed 740 times, Failed 1010 times)

Stack Traces | 59.7s run time

=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:27: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (59.74s)

github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 57.83% (Passed 730 times, Failed 1001 times)

Stack Traces | 201s run time

=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1260}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\n"}}
Executing command bash in sandbox icffsn677ic4q2ot80wdi (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory before tmpfs mount: 191 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Free memory before tmpfs mount: 793 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Memory to use in integrity test (60% of free, min 64MB): 475 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"475+0 records in\n475+0 records out\n498073600 bytes (498 MB, 475 MiB) copied, 1.85918 s, 268 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=475\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 1.84\n\tPercent of CPU this job got: 98%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:01.86\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2684\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 344\n\tVoluntary context switches: 4\n\tInvoluntary context switches: 32\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 671 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox ii54zzuzbbebmn4txkqsd
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1276}}
Executing command bash in sandbox ic487cegdob9yno4xisvr (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{data:{stdout:"f957f56d0c5dd75d69f891a81e2182c030dcf2df060e08b6e3d426f329c9f4bc\n"}}
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:80: Command [bash] completed successfully in sandbox ii54zzuzbbebmn4txkqsd
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1280}}
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:81
        	            				.../hostedtoolcache/go/1.26.3.../src/runtime/asm_amd64.s:1771
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox ii54zzuzbbebmn4txkqsd: unavailable: HTTP status 502 Bad Gateway
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:78
        	            				.../tests/orchestrator/sandbox_memory_integrity_test.go:110
        	Error:      	Condition never satisfied
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (201.11s)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

gemini-code-assist

Code Review

Creating and executing a Redis pipeline inside the loop in record defeats the purpose of pipelining and results in multiple sequential round-trips. Similarly, performing sequential Redis queries inside the loop in scores on the critical path of sandbox creation introduces unnecessary latency; both should be batched using a single pipeline outside the loops to improve performance.

Use a single pipeline for affinity reads and writes to keep the placement path bounded when the flag is enabled.

Use error-aware Redis checks and apply the repository formatter to the new feature flag.

Move cache affinity rollout knobs into the existing flag so Redis TTL, scoring, limits, and timeouts can be adjusted without a deploy.

Keep resume origin-node affinity as the primary signal and skip per-snapshot build keys so the Redis placement cache stays bounded.

Avoid Redis affinity lookups when resume already has a ready origin node and skip disabled affinity dimensions.

Use team-scoped template and base-template affinity as a low-cardinality fallback for resumes when origin-node placement is unavailable.

Track team-level placement history as the weakest cache-affinity fallback alongside template and base-template signals.

Keep template/team dimensions available for experiments but disable them by default until base build metadata is available.

Remove proxy affinity dimensions so the first rollout only uses concrete build cache signals.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Affinity node repeatedly selected despite ResourceExhausted failures
- Added the ResourceExhausted node to nodesExcluded to prevent affinity augmentation from re-selecting it on retry.

Or push these changes by commenting:

@cursor push bbadc5591a

Preview (bbadc5591a)

diff --git a/packages/api/internal/orchestrator/placement/placement.go b/packages/api/internal/orchestrator/placement/placement.go
--- a/packages/api/internal/orchestrator/placement/placement.go
+++ b/packages/api/internal/orchestrator/placement/placement.go
@@ -102,6 +102,7 @@
 
 		switch statusCode {
 		case codes.ResourceExhausted:
+			nodesExcluded[failedNode.ID] = struct{}{}
 			failedNode.PlacementMetrics.Skip(sbxRequest.GetSandbox().GetSandboxId())
 			logger.L().Warn(ctx, "Node exhausted, trying another node", logger.WithSandboxID(sbxRequest.GetSandbox().GetSandboxId()), logger.WithNodeID(failedNode.ID))
 		default:

_{You can send follow-ups to the cloud agent here.}

Exclude nodes that reject placement as exhausted so affinity retries can fall through to other candidates.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: ResourceExhausted tight retry loop on last remaining node
- Added explicit check to exit retry loop with 'no nodes available' error when the last remaining node returns ResourceExhausted, preventing infinite retries until context timeout.

Or push these changes by commenting:

@cursor push 7334edb433

Preview (7334edb433)

diff --git a/packages/api/internal/orchestrator/placement/placement.go b/packages/api/internal/orchestrator/placement/placement.go
--- a/packages/api/internal/orchestrator/placement/placement.go
+++ b/packages/api/internal/orchestrator/placement/placement.go
@@ -104,6 +104,9 @@
 		case codes.ResourceExhausted:
 			if len(nodesExcluded)+1 < len(clusterNodes) {
 				nodesExcluded[failedNode.ID] = struct{}{}
+			} else {
+				// All nodes are exhausted; exit the retry loop.
+				return nil, errors.New("no nodes available")
 			}
 			failedNode.PlacementMetrics.Skip(sbxRequest.GetSandbox().GetSandboxId())
 			logger.L().Warn(ctx, "Node exhausted, trying another node", logger.WithSandboxID(sbxRequest.GetSandbox().GetSandboxId()), logger.WithNodeID(failedNode.ID))

_{You can send follow-ups to the cloud agent here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 38d8022e3b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 75d9d736d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Affinity bonus not normalized to placement score scale
- Normalized the affinity bonus by dividing it by totalCapacity to ensure consistent relative impact across nodes of different sizes.

Or push these changes by commenting:

@cursor push 878be92bc5

Preview (878be92bc5)

diff --git a/packages/api/internal/orchestrator/placement/placement_best_of_K.go b/packages/api/internal/orchestrator/placement/placement_best_of_K.go
--- a/packages/api/internal/orchestrator/placement/placement_best_of_K.go
+++ b/packages/api/internal/orchestrator/placement/placement_best_of_K.go
@@ -63,7 +63,7 @@
 
 	score := (cpuRequested + float64(reserved) + config.Alpha*usageAvg) / totalCapacity
 	if len(affinityScores) > 0 {
-		score -= affinityScores[0][node.ID]
+		score -= affinityScores[0][node.ID] / totalCapacity
 	}
 
 	return score

_{You can send follow-ups to the cloud agent here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Exhausted nodes never cleared when hard failure fills threshold
- Added logic to check and clear nodesExhausted before returning 'no nodes available', ensuring capacity-exhausted nodes are retried even when a hard failure fills the threshold.

Or push these changes by commenting:

@cursor push 61822754ac

Preview (61822754ac)

diff --git a/packages/api/internal/orchestrator/placement/placement.go b/packages/api/internal/orchestrator/placement/placement.go
--- a/packages/api/internal/orchestrator/placement/placement.go
+++ b/packages/api/internal/orchestrator/placement/placement.go
@@ -62,7 +62,12 @@
 				skip[id] = struct{}{}
 			}
 			if len(skip) >= len(clusterNodes) {
-				return nil, errors.New("no nodes available")
+				if len(nodesExhausted) > 0 {
+					clear(nodesExhausted)
+					attempt++
+				} else {
+					return nil, errors.New("no nodes available")
+				}
 			}
 
 			node, err = algorithm.chooseNode(ctx, clusterNodes, skip, nodemanager.SandboxResources{CPUs: sbxRequest.GetSandbox().GetVcpu(), MiBMemory: sbxRequest.GetSandbox().GetRamMb()}, buildMachineInfo, labelFilteringEnabled, requiredLabels, affinityScores...)

_{You can send follow-ups to the cloud agent here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a6080cc64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Exhausted-pool retry never triggers with ineligible cluster nodes
- Fixed by counting only eligible nodes (passing status, CPU, and label checks) instead of all cluster nodes when determining if the exhausted pool should be retried.

Or push these changes by commenting:

@cursor push 6e8cf6cf45

Preview (6e8cf6cf45)

diff --git a/packages/api/internal/orchestrator/placement/placement.go b/packages/api/internal/orchestrator/placement/placement.go
--- a/packages/api/internal/orchestrator/placement/placement.go
+++ b/packages/api/internal/orchestrator/placement/placement.go
@@ -10,8 +10,11 @@
 	"google.golang.org/grpc/codes"
 	"google.golang.org/grpc/status"
 
+	"github.com/e2b-dev/infra/packages/api/internal/api"
 	"github.com/e2b-dev/infra/packages/api/internal/orchestrator/nodemanager"
 	"github.com/e2b-dev/infra/packages/api/internal/utils"
+	"github.com/e2b-dev/infra/packages/shared/pkg/consts"
+	"github.com/e2b-dev/infra/packages/shared/pkg/env"
 	"github.com/e2b-dev/infra/packages/shared/pkg/grpc/orchestrator"
 	"github.com/e2b-dev/infra/packages/shared/pkg/logger"
 	"github.com/e2b-dev/infra/packages/shared/pkg/machineinfo"
@@ -22,6 +25,41 @@
 
 var errSandboxCreateFailed = errors.New("failed to create a new sandbox, if the problem persists, contact us")
 
+// countEligibleNodes returns the count of nodes that pass basic static eligibility criteria.
+// This excludes nodes with wrong status, incompatible CPU, or mismatched labels—factors
+// that don't change during placement retries.
+func countEligibleNodes(clusterNodes []*nodemanager.Node, buildMachineInfo machineinfo.MachineInfo, labelFilteringEnabled bool, requiredLabels []string) int {
+	count := 0
+	for _, node := range clusterNodes {
+		if isNodeEligible(node, buildMachineInfo, labelFilteringEnabled, requiredLabels) {
+			count++
+		}
+	}
+	return count
+}
+
+// isNodeEligible checks static eligibility criteria that don't change during retries.
+// These include node status, CPU compatibility, and label matching.
+func isNodeEligible(node *nodemanager.Node, buildMachineInfo machineinfo.MachineInfo, filterByLabels bool, requiredLabels []string) bool {
+	// Local nodes are synthetic and bypass status/label checks
+	if env.IsLocal() && node.ClusterID == consts.LocalClusterID {
+		return true
+	}
+	// Node must be ready
+	if node.Status() != api.NodeStatusReady {
+		return false
+	}
+	// CPU must be compatible
+	if !isNodeCPUCompatible(node, buildMachineInfo) {
+		return false
+	}
+	// Labels must match if filtering is enabled
+	if filterByLabels && !isNodeLabelsCompatible(node, requiredLabels) {
+		return false
+	}
+	return true
+}
+
 // Algorithm defines the interface for sandbox placement strategies.
 // Implementations should choose an optimal node based on available resources
 // and current load distribution.
@@ -42,6 +80,7 @@
 		node = preferredNode
 	}
 
+	eligibleNodeCount := countEligibleNodes(clusterNodes, buildMachineInfo, labelFilteringEnabled, requiredLabels)
 	attempt := 0
 	for attempt < maxRetries {
 		select {
@@ -61,7 +100,7 @@
 			for id := range nodesExhausted {
 				skip[id] = struct{}{}
 			}
-			if len(skip) >= len(clusterNodes) {
+			if len(skip) >= eligibleNodeCount {
 				return nil, errors.New("no nodes available")
 			}
 
@@ -114,7 +153,7 @@
 			failedNode.PlacementMetrics.Skip(sbxRequest.GetSandbox().GetSandboxId())
 			// Once every node is excluded but some were only capacity-exhausted,
 			// retry the whole exhausted pool since capacity may free up.
-			if len(nodesExcluded)+len(nodesExhausted) >= len(clusterNodes) {
+			if len(nodesExcluded)+len(nodesExhausted) >= eligibleNodeCount {
 				clear(nodesExhausted)
 				attempt++
 			}

_{You can send follow-ups to the cloud agent here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Nil affinity map always triggers Score branch needlessly
- Added nil check to avoid passing nil map as variadic argument, preventing unnecessary affinity scoring branch execution when feature is disabled.

Or push these changes by commenting:

@cursor push 9d641d9600

Preview (9d641d9600)

diff --git a/packages/api/internal/orchestrator/create_instance.go b/packages/api/internal/orchestrator/create_instance.go
--- a/packages/api/internal/orchestrator/create_instance.go
+++ b/packages/api/internal/orchestrator/create_instance.go
@@ -337,7 +337,11 @@
 		affinityScores = o.placementAffinity.scores(ctx, placementCacheAffinityConfig, nodeClusterID, affinityBuildID)
 	}
 
-	node, err = placement.PlaceSandbox(ctx, o.placementAlgorithm, clusterNodes, node, sbxRequest, builds.ToMachineInfo(sbxData.Build), labelFilteringEnabled, team.SandboxSchedulingLabels, affinityScores)
+	if affinityScores != nil {
+		node, err = placement.PlaceSandbox(ctx, o.placementAlgorithm, clusterNodes, node, sbxRequest, builds.ToMachineInfo(sbxData.Build), labelFilteringEnabled, team.SandboxSchedulingLabels, affinityScores)
+	} else {
+		node, err = placement.PlaceSandbox(ctx, o.placementAlgorithm, clusterNodes, node, sbxRequest, builds.ToMachineInfo(sbxData.Build), labelFilteringEnabled, team.SandboxSchedulingLabels)
+	}
 	if err != nil {
 		return sandbox.Sandbox{}, &api.APIError{
 			Code:      http.StatusInternalServerError,

_{You can send follow-ups to the cloud agent here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fdb1b1e768

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Do not use template-builder instances as placement nodes when local orchestrator discovery has no candidates.

…2b-dev/infra into HEAD

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53bfe748ae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Count exhausted-pool retries so a fully saturated cluster exits after the normal retry budget.

…2b-dev/infra into HEAD

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9277fc3cd9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Exhausted retry counter unreachable due to attempt increment
- Removed the attempt++ increment from the exhausted pool retry path so maxExhaustedRetries (100) can be reached before the outer loop terminates at maxRetries (3).

Or push these changes by commenting:

@cursor push 9277fc3cd9

_{You can send follow-ups to the cloud agent here.}

Let ResourceExhausted pool retries use their own retry cap instead of consuming the create-failure retry budget.

…2b-dev/infra into HEAD

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 202bab4374

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Rerequested review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d362021fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Affinity bonus not normalized by totalCapacity in Score
- Normalized the affinity bonus by dividing it by totalCapacity to match the base score normalization, ensuring consistent placement behavior across nodes with different CPU counts.

Or push these changes by commenting:

@cursor push b7048af342

Preview (b7048af342)

diff --git a/packages/api/internal/orchestrator/placement/placement_best_of_K.go b/packages/api/internal/orchestrator/placement/placement_best_of_K.go
--- a/packages/api/internal/orchestrator/placement/placement_best_of_K.go
+++ b/packages/api/internal/orchestrator/placement/placement_best_of_K.go
@@ -63,7 +63,7 @@
 
 	score := (cpuRequested + float64(reserved) + config.Alpha*usageAvg) / totalCapacity
 	if len(affinityScores) > 0 {
-		score -= affinityScores[0][node.ID]
+		score -= affinityScores[0][node.ID] / totalCapacity
 	}
 
 	return score

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit b9916c4. Configure here.}

cursor · 2026-05-31T05:37:48Z

+	score := (cpuRequested + float64(reserved) + config.Alpha*usageAvg) / totalCapacity
+	if len(affinityScores) > 0 {
+		score -= affinityScores[0][node.ID]
+	}


Affinity bonus not normalized by totalCapacity in Score

Medium Severity

The base placement score is divided by totalCapacity (R * cpuCount), but the affinity bonus subtraction on line 66 is applied as a raw absolute value without the same normalization. The PR discussion explicitly states this was fixed ("normalizing the affinity bonus in Score on the same totalCapacity denominator as the base score"), but the code subtracts affinityScores[0][node.ID] directly from the already-normalized score. This causes the affinity bonus to have a disproportionately larger relative effect on high-CPU nodes (where base scores are smaller) compared to low-CPU nodes.

^{Reviewed by Cursor Bugbot for commit b9916c4. Configure here.}

dobrac · 2026-05-31T06:50:00Z

+	if len(clusterNodes) == 0 {
+		if nodeClusterID == consts.LocalClusterID {
+			o.discoverNomadNodes(ctx)
+			if len(o.GetClusterNodes(nodeClusterID)) == 0 {
+				_ = o.connectToNode(ctx, nodemanager.NomadServiceDiscovery{
+					NomadNodeShortID:    "local",
+					OrchestratorAddress: net.JoinHostPort("localhost", env.GetEnv("ORCHESTRATOR_PORT", "5008")),
+					IPAddress:           "localhost",
+				})
+			}
+		} else {
+			o.discoverClusterNode(ctx, nodeClusterID)
+		}
+		clusterNodes = o.GetClusterNodes(nodeClusterID)
+	}


what does this fix?

dobrac · 2026-05-31T06:50:00Z

+	if affinityScores == nil {
+		node, err = placement.PlaceSandbox(ctx, o.placementAlgorithm, clusterNodes, node, sbxRequest, builds.ToMachineInfo(sbxData.Build), labelFilteringEnabled, team.SandboxSchedulingLabels)
+	} else {
+		node, err = placement.PlaceSandbox(ctx, o.placementAlgorithm, clusterNodes, node, sbxRequest, builds.ToMachineInfo(sbxData.Build), labelFilteringEnabled, team.SandboxSchedulingLabels, affinityScores)


the nil check here might not be required

feat(api): add cache-aware placement affinity

e7a75c0

Record recent successful placements and use them as a small, gated best-of-K bias so warm nodes are preferred without bypassing placement filters.

cla-bot Bot added the cla-signed label May 30, 2026

gemini-code-assist Bot reviewed May 30, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/placement_affinity.go Outdated

Comment thread packages/api/internal/orchestrator/placement_affinity.go Outdated

ValentaTomas added 9 commits May 29, 2026 18:59

fix(api): batch placement affinity redis calls

57f7c27

Use a single pipeline for affinity reads and writes to keep the placement path bounded when the flag is enabled.

fix(api): resolve placement affinity lint

001c7a9

Use error-aware Redis checks and apply the repository formatter to the new feature flag.

fix(api): make placement affinity tunable

3eb3484

Move cache affinity rollout knobs into the existing flag so Redis TTL, scoring, limits, and timeouts can be adjusted without a deploy.

fix(api): avoid high-cardinality resume affinity keys

497d6cf

Keep resume origin-node affinity as the primary signal and skip per-snapshot build keys so the Redis placement cache stays bounded.

fix(api): skip unused placement affinity reads

c8066f5

Avoid Redis affinity lookups when resume already has a ready origin node and skip disabled affinity dimensions.

fix(api): scope placement affinity by team template

428d6ab

Use team-scoped template and base-template affinity as a low-cardinality fallback for resumes when origin-node placement is unavailable.

fix(api): add low-weight team placement affinity

f20296d

Track team-level placement history as the weakest cache-affinity fallback alongside template and base-template signals.

fix(api): default placement affinity to build keys

17db418

Keep template/team dimensions available for experiments but disable them by default until base build metadata is available.

fix(api): keep placement affinity build-only

922bd06

Remove proxy affinity dimensions so the first rollout only uses concrete build cache signals.

cursor Bot reviewed May 30, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/placement/placement_best_of_K.go

fix(api): skip exhausted affinity nodes

4afd0cc

Exclude nodes that reject placement as exhausted so affinity retries can fall through to other candidates.

ValentaTomas marked this pull request as ready for review May 30, 2026 21:58

ValentaTomas requested review from dobrac and jakubno as code owners May 30, 2026 21:58

fix(api): keep single-node placement retries

efe533c

cursor Bot reviewed May 30, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/placement/placement.go Outdated

ValentaTomas added 2 commits May 30, 2026 15:30

fix(api): bound exhausted-node placement retries

a466269

fix(api): refresh empty placement node pools

38d8022

chatgpt-codex-connector Bot reviewed May 30, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/create_instance.go Outdated

fix(api): discover local combined orchestrators

35c3f9e

dobrac previously requested changes May 30, 2026

View reviewed changes

fix(api): connect local orchestrator on empty placement

75d9d73

chatgpt-codex-connector Bot reviewed May 30, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/client.go Outdated

cursor Bot reviewed May 30, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/placement/placement_best_of_K.go

fix(api): allow local placement while metrics settle

5b6375c

cursor Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/placement/placement.go Outdated

fix(api): refresh local cluster before placement

8a6080c

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/cache.go Outdated

cursor Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/placement/placement.go Outdated

ValentaTomas added 2 commits May 30, 2026 18:35

fix(api): drive exhausted-pool retry off node selection failure

08c664d

fix(api): connect local builders for placement fallback

c6e1e0c

cursor Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/create_instance.go Outdated

fix(api): keep local placement capacity filters

fdb1b1e

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/client.go Outdated

ValentaTomas added 3 commits May 30, 2026 19:58

fix(api): wait for exhausted placement capacity

eab1dc2

fix(api): ignore builders for placement discovery

cc81426

Do not use template-builder instances as placement nodes when local orchestrator discovery has no candidates.

Merge branch 'valenta/api-placement-affinity' of https://github.com/e…

53bfe74

…2b-dev/infra into HEAD

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/placement/placement.go

ValentaTomas added 4 commits May 30, 2026 20:05

fix(api): bound exhausted placement retries

4317b91

fix(api): bound exhausted placement retries

aa3228a

Count exhausted-pool retries so a fully saturated cluster exits after the normal retry budget.

Merge branch 'valenta/api-placement-affinity' of https://github.com/e…

dafa27b

…2b-dev/infra into HEAD

chore(api): trim placement affinity tests

9277fc3

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/placement/placement.go Outdated

cursor Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/placement/placement.go Outdated

ValentaTomas added 3 commits May 30, 2026 20:46

fix(api): connect local combined service for placement

e6620ef

fix(api): keep exhausted retries separate

b0d874a

Let ResourceExhausted pool retries use their own retry cap instead of consuming the create-failure retry budget.

Merge branch 'valenta/api-placement-affinity' of https://github.com/e…

202bab4

…2b-dev/infra into HEAD

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/create_instance.go Outdated

ValentaTomas requested a review from dobrac May 31, 2026 05:08

fix(api): avoid builder-only placement fallback

3d36202

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/api/internal/orchestrator/placement/placement_best_of_K.go Outdated

fix(api): keep affinity bonus on score scale

b9916c4

cursor Bot reviewed May 31, 2026

View reviewed changes

dobrac reviewed May 31, 2026

View reviewed changes

Conversation

ValentaTomas commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

codecov Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

ValentaTomas commented May 30, 2026 •

edited

Loading

cursor Bot commented May 30, 2026 •

edited

Loading

codecov Bot commented May 30, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading